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1 Abstract 



Global mobility flow data are at the heart of spatial epidemiological models used to predict 
infectious disease behavior but this wealth of data on human mobility has been largely 
neglected by reconstructions of pathogen evolutionary dynamics using viral genetic data. 
Although stochastic models of viral evolution may potentially be informed by such data, 
a major challenge lies in deciding which mobility processes are critical and to what extent 
they contribute to shaping contemporaneous distributions of pathogen diversity. Here, we 
develop a framework to integrate predictors of viral diffusion with phylogeographic inference 
and estimate human influenza H3N2 migration history while simultaneously testing and 
quantifying the factors that underly it. We provide evidence for air travel governing the 
global dynamics of human influenza whereas other prOCGSSGS cLCt dbt a more local scale. 



2 Introduction 

Global public health is repeatedly and increasingly challenged by the emergence of high- 
impact pathogens pjj. Novel influenza strains, severe acute respiratory syndrome (SARS) 
virus and Methicillin-resistant Staphylococcus aureus represent only a few examples of pathogens 
that exploited today's complex and voluminous human traffic and mobility to rapidly dis- 
seminate in our globalized world. 

The worldwide air transportation network is by far the most extensively studied mobility 
system in the context of human infectious disease dynamics [2]. Indeed, air travel represents 
an obvious driving force for the global circulation of seasonal influenza A (H3N2) viruses, 
and may explain the absence of locally persistent strains in between epidemic seasons [3j. 
Retrospective modeling of the 1968 'Hong Kong flu' pandemic spread demonstrated that 
the H3N2 virus diffused through a network of global cities interconnected by air travel 
P]. Numerous modeling studies have subsequently examined the influence of air travel on 
influenza spread (e.g. O El E]), but far less work has attempted to verify such models 
against underlying patterns of host movement [9] . 

Two studies on the timing and rate of seasonal influenza spread across the United States 
have highlighted the difficulty of resorting to standard epidemiological data to disentan- 
gle the contributions of different human transportation systems to influenza spread. Using 
weekly time series of excess mortalities between 1972 and 2002, Viboud et al. (2006) [9] 
demonstrated that the patterns of timing and incidence across the continental United States 
are significantly associated with euclidean distance and various measures of domestic trans- 
portation (including airline travel), but most strongly with rates of movement of people to 
and from their workplaces. In the same year, Brownstein et al. (2006) [10] demonstrated that 
the rate of inter-regional spread and timing of influenza in the United States, as measured 
using weekly influenza and pneumonia mortality statistics from 1996 to 2005, is predicted 
by domestic airline travel volume in November. Because both studies implicated a different 
key driver of seasonal influenza spread across the United States, the findings were subject of 
debate [TT] , in particular in the light of a mounting threat of an influenza pandemic [13] 
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and the need for decisions on implementing travel restrictions. Insights at the global scale are 
also revealed by simulation studies (e.g, [H]), and empirical analysis using global mortality 
data may prove even more challenging. 

As a historical record of epidemic spread, viral genetic data may offer a valuable alter- 
native for empirical verification of epidemic models. The power of fitting statistical models 
of evolution to observed sequence data has clearly been demonstrated by a number of sem- 
inal studies, e.g. by revealing the genetic dynamics of influenza A H3N2 seasonality [12] 
and spatial patterns of global H3N2 circulation |3l HE] . More generally, viral phylogenetic 
and epidemiological insights have culminated into a phylodynamic framework that unifies 
evolutionary and ecological dynamics to explain patterns of viral diversity [17]. Although 
model-based inference is increasingly used to reconstruct viral diffusion through time and 
space, these attempts typically fit parameter-rich models to sparse spatial data, and post-hoc 
interpretation of phylogeographic patterns are then difficult to relate directly to underlying 
ecological and evolutionary processes [18] . 

Here, we present a novel model-based approach to simultaneously reconstruct spatio- 
temporal history and test the contribution of potential diffusion predictors. Our phylo- 
geographic reconstruction considers discrete sampling locations defined by geographical and 
administrative boundaries as well as air communities identified through direct analysis of the 
global air transportation network. By parameterizing the discrete diffusion process in terms 
of the inclusion and contribution of predictors, our approach generally requires considerably 
fewer parameters compared to standard phylogeographic inference. We demonstrate how 
this model allows for the integration of viral genetic data and human mobility measures to 
draw inference about key drivers of global influenza dynamics. 

3 Methods 

3.1 Sequence and location data 

We compiled 1441 hemagglutinin sequences with known date and location of sampling pre- 
viously obtained by [3]. These sequences were sampled globally from 2002 to 2007 and are 
representative of a larger samphng (13,000 isolates) used for antigenic analysis [3]. We ex- 
plored different spatial and air travel-assisted subdivisions with subsampling to examine the 
impact of discrete sampling allocation and sample numbers per locations on our phylogeo- 
graphic estimates. In an attempt to include all sequence data while keeping the number 
of samples per location as balanced as possible, we first divided all the sequences into 26 
geographic regions (Table [g. Since these spatial partitions sometimes required arbitrary 
subdivisions (e.g. breaking up USA, China, Japan and Australia), we also applied a discrete 
location scheme that accommodated a single location for these spatially and administratively 
coherent regions, arriving at 15 geographic regions (Table |2|. Within each sampling year, 
we randomly down-sampled the five locations with the highest number of samples relative to 
their population size (USA: from 278 to 150; Australia: from 166 to 30; New Zealand: from 
59 to 20; Japan: from 341 to 75; South Korea: from 51 to 30) and analyzed three different 
subsampled data sets. 

Because passenger flux emerged as the main predictor in our phylogeographic model (see 



3 



Table 1: Absolute latitude, population size, population density, H3N2 sequence sample sizes 
and antigenic residuals for 26 global geographic regions. Antigenic residual estimates are described 
in [3X2] 

Region Absolute latitude Population size Population density H3N2 sample size Antigenic residual 





(degrees) 




(people per km^) 






Africa^ 


14.03 


1.96E+08 


26.67 


10 


-1.18 


Canada^ 


50.67 


3.23E+07 


2.82 


27 


0.18 


Europe 


50.34 


4.84E+08 


96.18 


80 


-0.51 


Indocnma 


13.83 


7.96E+07 


114.72 


59 


-0.30 


New Zealand 


42.03 


4.03E+06 


14.89 


59 


0.07 


Russia 


53.36 


1.47E+08 


7.86 


24 


-0.92 


East China 


28.30 


3.06E+08 


453.29 


56 


0.57 


Mexico 


19.43 


1.03E+08 


52.49 


10 


0.19 


Mideast Japan 


35.71 


4.54E+07 


895.36 


86 


-0.03 


Midwest China 


29.82 


3.08E+08 


159.11 


42 


1.44 


Midwest Japan 


35.18 


3.52E+07 


573.15 


80 


-0.13 


North China 


40.27 


1.24E+08 


308.49 


33 


0.91 


Northeast Australia 


25.35 


4.60E+06 


2.42 


79 


-0.91 


Northeast Japan 


39.43 


1.77E+07 


108.60 


82 


-0.19 


Northeast USA^ 


41.15 


5.45E+07 


129.59 


48 


0.64 


Northwest Australia 


24.72 


2.15E+06 


0.55 


27 


0.17 


West USA5 


40.34 


3.20E+07 


12.13 


74 


0.19 


South America 


15.73 


3.12E+08 


20.71 


57 


-0.57 


West & South Asia'' 


27.17 


1.28E+09 


369.15 


20 


-2.33 


South Australia 


36.77 


1.35E+07 


6.48 


60 


-0.74 


South China 


22.33 


1.02E+08 


563.84 


50 


0.89 


South Korea 


36.17 


4.73E+07 


475.35 


52 


-0.08 


Southeast Asia* 


5.47 


1.19E+08 


169.04 


58 


0.19 


South USA^ 


33.61 


1.16E+08 


51.59 


104 


0.50 


Southwest Japan 


34.02 


2.51E+07 


275.43 


93 


-0.36 


Midwest USA^ 


42.02 


5.30E+07 


44.78 


62 


0.28 



^ includes Algeria, Egypt, Madagascar, South Africa and Saudi Arabia 

includes Canada and Alaska 
^ includes Cambodia and Thailand 

includes Russia and Mongolia 
^ USA is partitioned according to the US census bureau regions 
^ includes India, Nepal and Bangladesh 
^ includes Philippines, Singapore, Malaysia and Guam 



3.3.1 ), we also identified discrete air communities in the worldwide air transportation network 



see 



3.2.1) and applied these as location states to our sequence sample. To increase sequence 



numbers for under-sampled air communities, we also complemented the hemagglutinin gene 
sequences with publicly available sequences from Africa (n = 21), USA (Hawaii, n = 4), 
Central America (n = 13), South America (n = 46) and Canada (n = 10). From this 
data set, we removed six sequences that appeared to be outliers in a root-to-tip divergence 
versus sampling time regression analysis, resulting in a total of 1529 sequences. Within 
each sampling year, we randomly down-sampled the four locations with the highest number 
of samples relative to their population size (USA: from 318 to 120; Oceania: from 225 to 
50; Japan: from 327 to 75; Southeast Asia: from 175 to 100) and analyzed three different 
subsampled data sets discretized according to the 14 air communities. 
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Table 2: Absolute latitude, population size, population density, H3N2 sequence sample and 
sub-sample sizes and antigenic residuals for 15 global geographic regions. 



Region Absolute latitude Population size Population density 
(degrees) (people per km^) 


H3N2 sample size 


Antigenic residual 


Africa 


14.03 


1.96E+08 


26.67 


10 


-1.18 


Canada 


50.67 


3.23E+07 


2.82 


27 


0.18 


Europe 


50.34 


4.84E+08 


96.18 


80 


-0.51 


Indochina 


13.83 


7.96E+07 


114.72 


59 


-0.30 


New Zealand 


42.03 


4.03E+06 


14.89 


59/201 


0.07 


Russia 


53.36 


1.47E+08 


7.86 


24 


-0.92 


China 


29.19 


8.40E+08 


263.06 


181 


0.92 


Japan 


36.01 


1.23E+08 


336.98 


341/751 


-0.18 


Australia 


29.38 


2.02E+07 


2.57 


166/301 


-0.70 


USA 


38.35 


2.56E+08 


39.37 


278/1501 


0.39 


Mexico 


19.43 


1.03E+08 


52.49 


10 


0.19 


South America 


15.73 


3.12E+08 


20.71 


58 


-0.57 


South Asia 


27.17 


1.28E+09 


369.15 


20 


-2.33 


South Korea 


36.17 


4.73E+07 


475.35 


51/301 


-0.08 


Southeast Asia 


5.47 


1.19E+08 


169.04 


58 


0.19 



^ Sample sizes arc provided before/after down-sampling 



Table 3: Absolute latitude, population size, population density, H3N2 sequence sample and 
sub-sample sizes and antigenic residuals for 14 global air communities. 



Region 


Absolute latitude 
(degrees) 


Population size 


Population density 
(people per km^) 


H3N2 sample size 


Antigenic residual 


Africa 


22.27 


8.03E+07 


44.53 


23 


-0.93 


USA 


37.33 


2.95E+08 


32.21 


318/1201 


0.34 


Taiwan 


25.04 


2.28E-f07 


629.54 


17 


0.25 


China 


32.42 


1.29E-h09 


134.88 


122 


0.88 


Russia 


55.60 


1.44E+08 


8.44 


17 


-1.05 


Oceania 


32.69 


2.46E+07 


3.06 


225/501 


-0.51 


West & South Asia 


27.61 


1.38E+09 


208.34 


26 


-0.11 


Japan 


36.03 


1.28EH-08 


342.62 


327/751 


-0.22 


Mexico 


19.65 


1.03E-h08 


52.49 


12 


0.46 


South America 


18.02 


3.12E+08 


20.71 


101 


-0.32 


Canada 


48.53 


3.16E+07 


3.17 


24 


0.11 


Europe 


48.44 


4.84E+08 


96.18 


85 


-0.58 


Southeast Asia 


15.27 


2.06E+08 


69.71 


175/1001 


0.10 


South Korea 


36.01 


4.73E+07 


475.35 


46 


-0.04 



^ Sample sizes are provided before/after down-sampling 



3.2 Air transportation data and identification of air transporta- 
tion communities 

3.2.1 Passenger flux data 

The worldwide air transportation network is defined by a passenger flux matrix that quan- 
tifles the number of passengers travehng between each pair of airports. We use a dataset 
provided by OAG (Official Airline Guide) Ltd. ( |http : / / www . oag . coin[ ) , containing 4,092 
airports and the number of seats on scheduled commercial flights between pairs of airports 
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during the years 2004-2006. The number of seats on scheduled commercial flights from air- 
port i to j is given by "^ij, which we take to be proportional to the number of passengers 
traveling. For the 26 location scheme, we summarized the number of passengers from the full 
aviation network for each pair of locations based on all the airports in the respective regions. 
To facilitate the identiflcation of 14 air communities, we focused on a 1227-largest-airport 
network that represented 95% of the passenger flux of the full aviation network by excluding 
the lowest contributing airports. By focusing on this subset of largest airports, we exclude 
a large number of very small community airports; details and plausibility of this reduction 
are discussed in [19]. 

3.2.2 Modularity maximization 

To identify air transportation communities, we approximate a maximal-modularity sub- 
division of the 1227-largest-airport network by employing a recently described stochastic 
Monte-Carlo approach [SU], a generalization of the method introduced in [5T]. Modularity 
provides a measure of how well the connectivity of a network is described by partitioning its 
nodes into non-overlapping groups. For any given partition, modularity will be high if con- 
nectivity within groups is high and connectivity among groups is low. In large networks, it 
is generally computationally impossible to flnd the optimal subdivision. To approximate the 
optimum a variety of approximative methods have been introduced. The method we employ 
here generates an ensemble of high modularity subdivisions and computes the consensus in 
this ensemble by superposition, for details see [20| [22]. 

For an ensemble of 1000 modularity subdivisions we quantify the uncertainty by an affin- 
ity matrix that, for each pair of locations, summarizes the fraction of partitions in which 
these locations are in the same community. Based on a partition encompassing a reasonably 
large number of air communities [n = 14), we subsequently obtain the average affinity for 
each airport to the communities in this partition. We assign each airport to the commu- 
nity for which it shows the highest average affinity, but we take into account its uncertainty 
by also considering assignments that yield affinities that are > 2/3 of the highest affinity 
score. This cut-off resulted in 771 ambiguous airport assignments. Finally, we partitioned 
the sequence data according to the air community assignment and accommodate 368 (24%) 
ambiguous sequence locations, i.e. those sequences related to airports with ambiguous com- 
munity assignments, using ambiguity coding in our phylogeographic approach. 

3.3 Bayesian statistical analysis of sequence and trait evolution 

We integrate genetic, spatial and air transportation data within a single full probabilistic 
evolutionary model and simultaneously estimate the parameters of phylogeographic diffu- 
sion using Markov chain Monte Carlo (MCMC) analysis implemented in BEAST |23]. We 
introduce novel models and inference procedures in the section below. To model sequence 
evolution, we partition the hemagglutinin codon positions into first -|-second and third po- 
sitions [21] and apply a separate HKY85 [25] CTMC model of nucleotide substitution with 
discrete gamma-distributed rate variation [26] to both. We assume a flexible Bayesian skyride 
prior over the unknown phylogeny [27]. Exploratory runs using the data for the 26 locations 
indicated that a relaxed molecular clock represented an over-parametrization [2H]- A strict 
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clock was therefore used in subsequent analyses. Because the exact date of sampling was not 
known for some additional publicly available sequences, we integrated out their dates over 
the known sampling time interval [29]. We capitalize on BEAGLE |30| in conjunction with 
BEAST to improve computational performance on our large data sets. MCMC analyses 
were run sufficiently long to ensure stationarity as diagnosed using Tracer. We used the 
TreeAnnotator tool in BEAST to summarize trees in the form of maximum clade credibility 
(MCC) trees. 



3.3.1 Phylogeographic inference and hypothesis testing 

We develop a novel model-based approach to simultaneously reconstruct spatiotemporal 
history and test the contribution of potential predictors of spatial diffusion. This approach 
builds on recently developed Bayesian phylogeographic inference methods to simultaneously 
reconstruct phylogenetic history and discretized diffusion processes [31]. These processes 
are modeled as continuous-time Markov chain processes parameterized in terms of a x 
K infinitesimal rate matrix of discrete location change (A). Here, we extend this model by 
adopting a generalized linear model (GLM) approach that considers every rate of movement 
(Ajj) in A as a log linear function of an arbitrary number of predictors X, such that: 



logkij = l3i5ilog{p^) + ^2S2log{p2) + ... + ^nSJog{pJ, (1) 

for n predictors, where /3 represents the effective size for the predictor log p, quantifying 
its contribution to Ajj, and S is an (O,l)-indicator variable that governs the inclusion or 
exclusion of the predictor in the model. The incorporation of indicator variables allows 
Bayesian stochastic search variable selection (BSSVS) [321 [33], which estimates the poste- 
rior probabilities of all possible linear models that may or may not include the predictors. 
When an indicator equals 1, this predictor is included in the model, demonstrating that it 
helps to explain the diffusion process in the phylogenetic history with high probability. We 
complete this GLM specification with variable selection by assigning independent Bernoulli 
prior probability distributions on 6, effectively placing equal probability on each predictors 
inclusion and exclusion. Lemey et al. (2009) [^ discuss BSSVS in further detail and anal- 
ogous to Edo-matas et al. (2011) [33], we can use Bayes factors (BFs) [3S], ES] to express 
how much the data change our prior opinion about the inclusion of each predictor. These 
BFs are calculated by dividing the posterior odds for the inclusion of a predictor with the 
corresponding prior odds (here, 1:1 odds). 

BFi = ^/^, (2) 

I- Pi l-Qi 

where Pi is the posterior probability that the predictor is included, in this case the posterior 
expectation of indicator 5j, and is the prior probability that 5i = 1. We specify that a 
priori the /3's are normally distributed with mean and a variance of 4. We implement the 
GLM-diffusion parametrization in the software package BEAST [37] and approximate the 
joint posterior and its marginalizations using standard Markov chain Monte Carlo (MCMC) 
transition kernels. An important, novel extension to the standard MCMC machinery in 
BEAST lies in generating an efficient Metropolis-Hastings proposal distribution for the GLM 
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coefficients (3. Given the potential for high correlation between predictors X, attempting 
to update one coefficient /3j at a time while holding the remaining constant returns high 
autocorrelations times. Instead, we exploit the fixed correlation structure X'X between 
predictors to generate a multivariate proposal f3*. In particular, if we assume f3 are the 
current realized values, then we draw 

(3'' ~ Muhivariate-Normal (^(3, a (X'X)"^ j , (3) 

where a is an auto-tunable variance scalar. Motivation for this proposal stems from imag- 
ining that the marginal posterior distribution of f3 under our phylogenetic GLM should 
partially approximate a simple linear regression model involving /3, whose posterior variance 
is proportional to X'X. We consider a 'bit-flip' operator on the Bernoulli rate indicators; 
this transition kernel is further discussed in |38] . 

We extended the phylogeographic inference techniques to take into account ambiguous 
location states in order to accommodate the uncertainty of the modularity maximization 



procedure in assigning airports to distinct discrete air communities (see 3.2.2) 



3.3.2 GLM-diffusion predictors 

Depending on the location state partitioning scheme, we considered several potential predic- 
tors of global influenza diffusion: 

• Average and minimum distance. To test whether geographical proximity predicts 
influenza diffusion we considered two different great-circle distance measures: (i) the 
average distance between two locations based on the pairwise distances between all 
pairs of airports from the two locations and (ii) the minimum distance amongst those 
pairwise distances. 

• Absolute latitude. Absolute latitudes for each region/community were calculated 
as the absolute values of the average latitudes of the sequence sampling locations (Se- 
quences from unknown locations within speciflc countries were assigned to the capital 
of that country) and are hsted in Tables [T| |2] and [3] 

• Passenger flux. The total number of passengers traveling between each pair of loca- 



tions per day (see 3.2.1). 



Population size and density. Demographic estimates obtained from www. citypopulation. 

de or Geographica [39j are hsted in Tables [l| [2] and [3j Origin and destination popula- 
tion sizes/densities were included as separate predictors. 

Viral surveillance data. To test the predictive power of viral surveillance data, we 
essentially aimed at capturing the nature and degree of synchronicity of yearly incidence 
proflles in each region. To this purpose, we extracted the number of influenza viruses 
A(H3) detected per country from week 1 in 1997 to week 45 in 2010 from FluNet/WHO 
(www.who.int/flunet) for relevant countries in the discrete partition schemes. Tai- 
wanese surveillance data was obtained from |3D]. We focused on the influenza A(H3) 
incidence counts between 2002 - 2007 or as close as possible to this time period when 
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Figure 1: H3 incidence profiles for the 14 air communities based on both raw and smoothed, 
normalized, weekly average counts for relevant countries in the air communities. For the following 

regions we include data from specific countries: Africa (South Africa), Oceania (Australia), South & West 
Asia (Bangladesh and India), South America (Argentina, Brazil, Chile, Ecuador, Paraguay, French Guiana, 
Peru, Uruguay and Venezuela), Europe (Bulgaria, Croatia, Czech Republic, Denmark, Finland, France, 
Germany, Greece, Iceland, Ireland, Italy, Latvia, Netherlands, Norway, Romania, Slovenia, Spain, Sweden, 
Switzerland, Turkey, Ukraine, UK and Ireland,) and Southeast Asia (Malaysia, Philippines, Singapore and 
Thailand). 



insufficient data was available. Average incidence counts were used when data from 
multiple countries per region/community was available. We subsequently calculated 
average incidences per week across multiple years for each region/community, normal- 
ized these weekly averages and smoothed them with a Gaussian standard deviation of 
2 weeks. Figure [T] depicts the resulting incidence profiles for the 14 air communities. 

We derived several potential predictors from these incidence profiles, including inci- 
dence overlap, origin incidence versus destination growth rate, peak time difference, 
and incidence in the origin location at fixed times prior to peak incidence in the des- 
tination location. The incidence overlap summarizes the overlapping area under the 
origin-destination incidence curves for each pair of locations. The origin incidence ver- 
sus destination growth rate sums the product of origin incidence and destination growth 
rate for each week of the year. Peak time difference quantifies the difference in peak 
incidence for each origin-destination pair. For the latter, we summarized the donor 
incidences at 4, 8, 12, 16, 20 and 24 weeks prior to peak incidence in the destination 
location as potential predictors. 
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Figure 2: Scatter plot with local regression (LOESS) fit for the first principle component 
(PCI) of the H3N2 antigenic measurements against time. Antigenic PCI data points arc colored 
according to the air communities represented in Fig. [3j 

• Antigenic evolution. Because antigenic evolution can provide insights into the seed- 
ing dynamics of seasonal H3N2 |3], we sought to include the average antigenic diver- 
gence for each region as phylogeographic diffusion predictors. Based on the available 
antigenic cartography data for the strains in our phylogeographic analyses, we per- 
formed a local regression (LOESS) of the principle antigenic component, obtained 
from a multidimensional scaling analysis of hemagglutination inhibition assay mea- 
surements [3], against time. The resulting scatter plot with strains colored according 
to air community is presented in Figure |2} Distances from the spline (residuals) were 
calculated for each antigenic measurement and average residuals were obtained for each 
region/community, which reflects whether a location is on average antigenically leading 
or trailing [3J • These average residuals are listed in Tables [T| [2] and [sj We considered 
the exponentiated residual and exponentiated negative residual as a measure of efflux 
and influx respectively for each location and included these as separate origin and 
destination predictors. 

• Sample sizes. To test the impact of sampling effects, we considered origin and des- 
tination sample sizes (number of H3N2 sequences included per discrete location state 
in the phylogeographic analysis) as separate predictors. Although sampling sizes are 
expected to have an impact on the number of location transitions, support for other 
factors in addition to sampling size predictors may suggest that they are robust to 
potential sampling biases. 
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All predictors were transformed to log space and standardized prior to their incorporation 
in the GLM approach. 

3.3.3 Fitting diffusion models to empirical tree distributions 

Although we generally desire to simultaneously reconstruct sequence and discrete/continuous 
trait evolution using our Bayesian statistical framework, integrating over tree-space becomes 
a computationally daunting task for a large number of taxa. The main limiting factor in 
Bayesian MCMC analysis of evolutionary history is typically the efficiency with which topol- 
ogy proposals sample phylogenetic tree space [41j. To side-step these limitations and reduce 
time to convergence, we seek to approximate phylogenetic uncertainty in our phylogeographic 
estimates in cases where sampling tree space needs to be performed repeatedly (e.g. when 
comparing different diffusion models). To this purpose, we follow [42| and implement pro- 
posal mechanisms to randomly draw from an empirical posterior distribution of trees, which, 
in our case, were solely inferred from sequence data. Because the likelihood of a tree topol- 
ogy will largely be dominated by an informative sequence alignment compared to a single 
discrete (location) site, we expect such an empirical distribution to closely approximate the 
phylogenetic uncertainty in the joint inference approach. 

4 Results 

To identify key factors in seasonal influenza dispersal, we inferred the phylogeographic history 
of globally sampled A/H3N2 viruses between 2002 and 2007, while simultaneously evaluat- 
ing the contribution of several potential diffusion predictors using a novel Bayesian model 
selection procedure. Our approach draws from recent developments in stochastic phyloge- 
netic diffusion models [31] , and extends these by parameterizing pairwise diffusion rates as a 
function of a number of potential predictors, distinguishing between the statistical support 
for a predictor and the magnitude of its effect (see Methods). We discretized the global sam- 
pling locations into 26 and 15 geographically deflned regions (Table [l] and [2] respectively) 
as well as into 14 distinct air travel communities (Table [3]). To identify this community 
structure in global air travel, we determined partitions with high intra-module connectivity 
and low inter-module connectivity in a passenger flux network of 1227 airports. Although 
this approach is blind to the geographic locations of the airports, the global air communities 
are spatially compact with few exceptions (Figure [3]). 

We compared a panel of possible predictors of phylogeographic diffusion using a gen- 
eralized linear model (GLM) approach (see Methods). Here, we considered geography, air 
travel (in the form of the number of passengers traveling between each pair of locations), de- 
mography, viral surveillance data, viral phenotypic evolution and sampling sizes as possible 
explanatory variables (Figure |4]). 

Our phylogeographic test does not attribute any importance to geographical proximity, 
absolute latitude (to model source-sink behavior between the tropics and Northern/Southern 
hemisphere, [IS]), population size and density, antigenic divergence or H3 incidence for the 
14 air communities and the 15 geographic regions. Instead, we provide consistent and over- 
whelming evidence for passenger flow driving the global H3N2 diffusion dynamics, as reflected 
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Figure 3: 14 global air communities identified through a modularity maximization analyses of 
air transportation data. The colored dots represent the airports in each community for which passenger 
flux data was used in the analysis. The areas with corresponding colors represent the geographical area within 
the communities for which H3N2 sequence samples were available. The 14 communities and associated data 
are listed in Tabled 



in the posterior support and a conditional effect size close to unity on a log scale for both 
the air community and geographic partitions. Despite efforts to down-sample presumably 
oversampled regions or communities (see Methods), there is still a role for sample size in both 
the air community and geographic partitions. Explicitly including sample sizes as diffusion 
predictors allows us to absorb potential effects of sampling bias, offering more credibility 
for other predictors that are included in the model. When applied to 26 geographic regions, 
which further partitions geographically and administratively coherent regions like US, China, 
Japan and Australia (Table [T]), our model also takes an important negative contribution from 
geographic distance and, correlated with this but with lower coefficients, population densi- 
ties. Here, distance most likely represents the role of other human mobility processes such 
as commuting, which has been shown to play a key role in the spread of influenza in the US 
[H]. The negative population density effect may suggest that commuting is to some extent 
less likely to occur out of or into dense subpopulations. 



5 Summary 

Influenza prevention and control critically relies on our understanding of its geographical 
transmission patterns. Here, we demonstrate the ability to jointly reconstruct phylogeo- 
graphic history while identifying the factors that contribute epidemic spread from viral ge- 
netic data. Our analysis of global influenza transmission provides evidence for the key role 
of air travel, which is highly intuitive and has long been predicted by modeling studies (e.g. 
[3] ), but remained difficult to ascertain from empirical data. The predictors of influenza 
diffusion will undoubtedly be scale-dependent as indicated by the role of geographic distance 
within more confined geographic areas (Figure 111), and this may represent other forms of 
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Figure 4: Predictors of global H3N2 diffusion among the 14 air communities and the 15 Sz 26 
geographic locations. The inclusion probabilities are defined by the indicator expectations E[S] because 
they reflect the frequency by which the predictor is included in the model and therefore represent the support 
for the predictor. Bayes factor support values corresponding to two different indicator expectations, indicated 
a the bottom of the plots, are shown as vertical lines. The contribution of each predictor, when included in 
the model {/3\S — 1), is represented by the mean and credible intervals of the GLM coefficients on a log scale. 
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human mobility such as commuting [H], which can be tested in future apphcations. More 
generally, our novel phylogenetic diffusion approach may be applied to different infectious 
diseases problems and provide entirely new opportunities for testing how host ecology shapes 
the distribution of pathogen genetic data. 
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